Link to slides:
https://m-clark.github.io/workshops/dplyr/mainSlides.html
You will get a pop-under window on the lab machines!
As we go along, you can install other packages if you want to run the demos (not necessary)
Newer approaches to data wrangling
Nothin’s gonna stop the flow
Quick interactive visualizations
Let’s say we want to select the following from our data:
How might we go about this?
Typically tedious
Multiple steps just to get the columns you want
# numeric indexes; not conducive to readibility or reproducibility
newData = oldData[,c(1,2,3,4, etc.)]
# explicitly by name; fine if only a handful; not pretty
newData = oldData[,c('ID','X1', 'X2', etc.)]
# two step with grep; regex difficult to read/understand
cols = c('ID', paste0('X', 1:10), 'var1', 'var2', grep(colnames(oldData), '^XYZ', value=T))
newData = oldData[,cols]
# or via subset
newData = subset(oldData, select = cols)What if you also want observations where Z is Yes, Q is No ordered by var1 (descending)… and only the last 50 of those results?
# three operations and overwriting or creating new objects if we want clarity
newData = newData[oldData$Z == 'Yes' & oldData$Q == 'No',]
newData = tail(newData, 50)
newData = newdata[order(newdata$var1, decreasing=T),]And this is for fairly straightforward operations.
newData = oldData %>%
filter(Z == 'Yes', Q == 'No') %>%
select(num_range('X', 1:10), contains('var'), starts_with('XYZ')) %>%
arrange(desc(var1)) %>%
tail(50)Piping is an alternative
You can do all this sort of stuff with base R
While the base R approach can be concise, it is potentially:
%>% : Passes the prior object to the function after the pipe
object %>% function(object)wikiURL = 'https://en.wikipedia.org/wiki/List_of_United_States_cities_by_population'
# Let's go!
wikiURL %>%
read_html() %>% # parse the html
html_node(css='.wikitable.sortable') %>% # grab a class of object
html_table() %>% # convert table to data.frame
sapply(function(x) repair_encoding(as.character(x), 'UTF-8')) %>% # repair encoding; makes a matrix
data.frame() %>% # back to df
mutate(City = str_replace(City, '\\[(.*?)\\]', ''), # remove footnotes
latlon = sapply(str_split(Location, '/'), last), # split up location (3 parts)
latlon = str_extract_all(latlon, '[-|[0-9]]+\\.[0-9]+'), # grab any that start with - or number
lat = sapply(latlon, first), # grab latitudes
lon = sapply(latlon, nth, 2), # grab longitude
population2016 = as.numeric(str_replace_all(X2016.estimate, ',', '')), # remove commas from numbers (why do people do this?)
population2010 = as.numeric(str_replace_all(X2010.Census, ',', '')), # same for 2010
popDiff = round(population2016/population2010 - 1, 2)*100) %>% # create percentage difference select(-latlon, -Location) %>% # remove stuff we wouldn't ever use
filter(as.numeric(as.character(X2016.rank)) <= 50) %>% # top 50
plot_geo(locationmode = 'USA-states', sizes = c(1, 250)) %>% # map with plotly
add_markers(x = ~lon, y = ~lat, size = ~abs(popDiff), hoverinfo='text',
color=~popDiff, hoverinfo='text', colors='RdBu',
text=~hovertext, marker = list(opacity = 0.5)) %>%
layout(title = 'Largest US cities and their 2010-2016 change',
geo = g, paper_bgcolor='#fdf6e3')
# g is a list of map options, e.g. projection line colors etc.For your own code, try a more concise approach
However, it serves as an illustration of what’s possible
Packages have been created to make data wrangling easier
We will focus on plyr, dplyr, and tidyr
But others, e.g. data.table, may be useful as well
Newer visualization packages work similarly
c('Ceci', "n'est", 'pas', 'une', 'pipe!') %>%
{
.. <- . %>%
if (length(.) == 1) .
else paste(.[1], '%>%', ..(.[-1]))
..(.)
} c('Ceci', "n'est", 'pas', 'une', 'pipe!') %>%
{
.. <- . %>%
if (length(.) == 1) .
else paste(.[1], '%>%', ..(.[-1]))
..(.)
} [1] "Ceci %>% n'est %>% pas %>% une %>% pipe!"
Let’s get to it!
data %>%
functiondata %>%
function(arg='blah')Note that Ctrl+Shft+m is the shortcut to make the %>% pipe.
iris %>% summary Sepal.Length Sepal.Width Petal.Length Petal.Width Species
Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 setosa :50
1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 versicolor:50
Median :5.800 Median :3.000 Median :4.350 Median :1.300 virginica :50
Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
We often want to calculate new variables
Or perhaps make changes to ones we have
We can use mutate or transmute for this
First, let’s scrape some data:
url = "http://www.basketball-reference.com/leagues/NBA_2015_totals.html"
bball = read_html(url) %>%
html_nodes("#totals_stats") %>%
html_table %>%
data.frame %>%
filter(Rk != "Rk")The data is currently all character strings
glimpse(bball[,1:5])Observations: 651
Variables: 5
$ Rk <chr> "1", "2", "3", "4", "5", "5", "5", "6", "7", "8", "9", "10", "11", "12", "13", "13", "13", "14", "15...
$ Player <chr> "Quincy Acy", "Jordan Adams", "Steven Adams", "Jeff Adrien", "Arron Afflalo", "Arron Afflalo", "Arro...
$ Pos <chr> "PF", "SG", "C", "PF", "SG", "SG", "SG", "C", "PF", "C", "PF", "C", "SG", "SF", "PF", "PF", "PF", "C...
$ Age <chr> "24", "20", "21", "28", "29", "29", "29", "26", "23", "26", "29", "25", "33", "24", "32", "32", "32"...
$ Tm <chr> "NYK", "MEM", "OKC", "MIN", "TOT", "DEN", "POR", "NOP", "PHI", "NYK", "POR", "IND", "MEM", "DAL", "T...
We’ll use mutate_at to make appropriate columns numeric
bball = bball %>%
mutate_at(vars(-Player, -Pos, -Tm), funs(as.numeric))
glimpse(bball[,1:7])Observations: 651
Variables: 7
$ Rk <dbl> 1, 2, 3, 4, 5, 5, 5, 6, 7, 8, 9, 10, 11, 12, 13, 13, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24,...
$ Player <chr> "Quincy Acy", "Jordan Adams", "Steven Adams", "Jeff Adrien", "Arron Afflalo", "Arron Afflalo", "Arro...
$ Pos <chr> "PF", "SG", "C", "PF", "SG", "SG", "SG", "C", "PF", "C", "PF", "C", "SG", "SF", "PF", "PF", "PF", "C...
$ Age <dbl> 24, 20, 21, 28, 29, 29, 29, 26, 23, 26, 29, 25, 33, 24, 32, 32, 32, 36, 32, 21, 26, 20, 30, 32, 32, ...
$ Tm <chr> "NYK", "MEM", "OKC", "MIN", "TOT", "DEN", "POR", "NOP", "PHI", "NYK", "POR", "IND", "MEM", "DAL", "T...
$ G <dbl> 68, 30, 70, 17, 78, 53, 25, 68, 41, 61, 71, 63, 63, 74, 53, 12, 41, 60, 74, 33, 61, 81, 40, 49, 63, ...
$ GS <dbl> 22, 0, 67, 0, 72, 53, 19, 8, 9, 16, 71, 0, 41, 3, 35, 0, 35, 20, 19, 8, 5, 71, 40, 0, 3, 82, 4, 76, ...
mutate takes a vector and returns one of the same dimension
We will contrast it with summarize later
A common task- creating composites of existing variables
bball = bball %>%
mutate(trueShooting = PTS / (2 * (FGA + (.44 * FTA))),
effectiveFG = (FG + (.5 * X3P)) / FGA,
shootingDif = trueShooting - FG.)
summary(select(bball, shootingDif)) # select and others don't have to be piped to use shootingDif
Min. :-0.08645
1st Qu.: 0.04193
Median : 0.08016
Mean : 0.07855
3rd Qu.: 0.11292
Max. : 0.25000
NA's :2
Note how we use the new variables
We can use them without quotes
We can use use any variable as soon as it’s created
Sometimes we want to combine (or split) variables…
unite creates a string combination of others
library(tidyr)
bball %>%
unite("posTeam", Pos, Tm) %>%
select(1:5) %>%
head Rk Player posTeam Age G
1 1 Quincy Acy PF_NYK 24 68
2 2 Jordan Adams SG_MEM 20 30
3 3 Steven Adams C_OKC 21 70
4 4 Jeff Adrien PF_MIN 28 17
5 5 Arron Afflalo SG_TOT 29 78
6 5 Arron Afflalo SG_DEN 29 53
separate does the opposite
Separate player into first and last names based on the space
bball %>%
separate(Player, into=c('firstName', 'lastName'), sep=' ') %>%
select(1:5) %>%
head Rk firstName lastName Pos Age
1 1 Quincy Acy PF 24
2 2 Jordan Adams SG 20
3 3 Steven Adams C 21
4 4 Jeff Adrien PF 28
5 5 Arron Afflalo SG 29
6 5 Arron Afflalo SG 29
Data: state.x77
state.x77 %>%
data.frame %>%
mutate(popLog = log(Population),
curLifeExp = Life.Exp+5) %>%
summary Population Income Illiteracy Life.Exp Murder HS.Grad Frost
Min. : 365 Min. :3098 Min. :0.500 Min. :67.96 Min. : 1.400 Min. :37.80 Min. : 0.00
1st Qu.: 1080 1st Qu.:3993 1st Qu.:0.625 1st Qu.:70.12 1st Qu.: 4.350 1st Qu.:48.05 1st Qu.: 66.25
Median : 2838 Median :4519 Median :0.950 Median :70.67 Median : 6.850 Median :53.25 Median :114.50
Mean : 4246 Mean :4436 Mean :1.170 Mean :70.88 Mean : 7.378 Mean :53.11 Mean :104.46
3rd Qu.: 4968 3rd Qu.:4814 3rd Qu.:1.575 3rd Qu.:71.89 3rd Qu.:10.675 3rd Qu.:59.15 3rd Qu.:139.75
Max. :21198 Max. :6315 Max. :2.800 Max. :73.60 Max. :15.100 Max. :67.30 Max. :188.00
Area popLog curLifeExp
Min. : 1049 Min. :5.900 Min. :72.96
1st Qu.: 36985 1st Qu.:6.984 1st Qu.:75.12
Median : 54277 Median :7.951 Median :75.67
Mean : 70736 Mean :7.863 Mean :75.88
3rd Qu.: 81163 3rd Qu.:8.511 3rd Qu.:76.89
Max. :566432 Max. :9.962 Max. :78.60
Often you do not need the entire data set
Easily handled in base R (as shown earlier)
It can be more clear to use select in dplyr
bball %>%
select(Player, Tm, Pos, MP, trueShooting, effectiveFG, PTS) %>%
summary Player Tm Pos MP trueShooting effectiveFG
Length:651 Length:651 Length:651 Min. : 1 Min. :0.0000 Min. :0.0000
Class :character Class :character Class :character 1st Qu.: 272 1st Qu.:0.4719 1st Qu.:0.4402
Mode :character Mode :character Mode :character Median : 896 Median :0.5174 Median :0.4799
Mean :1042 Mean :0.5060 Mean :0.4728
3rd Qu.:1674 3rd Qu.:0.5555 3rd Qu.:0.5180
Max. :2981 Max. :1.0638 Max. :1.0000
NA's :2 NA's :2
PTS
Min. : 0.0
1st Qu.: 90.0
Median : 308.0
Mean : 428.1
3rd Qu.: 658.0
Max. :2217.0
What if we want to drop some variables?
scoringDat = bball %>% # creating object as we will come back to it later
select(Player, Tm, Pos, MP, trueShooting, effectiveFG, PTS)
scoringDat %>%
select(-Player, -Tm, -Pos) %>%
cor(use='complete') %>%
round(2) MP trueShooting effectiveFG PTS
MP 1.00 0.34 0.30 0.93
trueShooting 0.34 1.00 0.96 0.33
effectiveFG 0.30 0.96 1.00 0.27
PTS 0.93 0.33 0.27 1.00
Sometimes, we have a lot of variables to select
If they have a common naming scheme, this can be very easy
bball %>%
select(Player, contains("3P"), ends_with("RB")) %>%
arrange(desc(TRB)) %>%
head Player X3P X3PA X3P. ORB DRB TRB
1 DeAndre Jordan 1 4 0.250 397 829 1226
2 Andre Drummond 0 2 0.000 437 667 1104
3 Pau Gasol 12 26 0.462 220 699 919
4 Tyson Chandler 0 0 NA 294 570 864
5 Nikola Vucevic 2 6 0.333 238 572 810
6 Rudy Gobert 0 2 0.000 265 510 775
Recall this bit of code?
bball = read_html(url) %>%
html_nodes("#totals_stats") %>%
html_table %>%
data.frame %>%
filter(Rk != "Rk")You will notice the filter line at the end
We sometimes want to see a very specific portion of the data
bball %>%
filter(Age > 35, Pos == "SF" | Pos == "PF")bball %>%
slice(1:10)This can be done with things that are created on the fly…
bball %>%
unite("posTeam", Pos, Tm) %>% # create a new variable
filter(posTeam == "PF_SAS") %>% # use it for filtering
arrange(desc(PTS/PF)) %>% # order by on-the-fly operation
select(1:10) Rk Player posTeam Age G GS MP FG FGA FG.
1 126 Boris Diaw PF_SAS 32 81 15 1984 291 632 0.460
2 56 Matt Bonner PF_SAS 34 72 19 935 94 230 0.409
3 26 Jeff Ayres PF_SAS 27 51 0 383 55 95 0.579
4 187 JaMychal Green PF_SAS 24 4 0 25 4 7 0.571
A brief exercise:
\(3.\) bonus: redo, but instead, filter if the ratio of Petal.Length to Petal.Width is greater than 5.
Which species do these observations belong to?
iris %>%
filter(Petal.Length/Petal.Width > 5) %>%
summary Sepal.Length Sepal.Width Petal.Length Petal.Width Species
Min. :4.300 Min. :2.900 Min. :1.100 Min. :0.1000 setosa :34
1st Qu.:4.800 1st Qu.:3.100 1st Qu.:1.400 1st Qu.:0.2000 versicolor: 0
Median :5.000 Median :3.400 Median :1.450 Median :0.2000 virginica : 0
Mean :4.982 Mean :3.382 Mean :1.456 Mean :0.1882
3rd Qu.:5.200 3rd Qu.:3.575 3rd Qu.:1.500 3rd Qu.:0.2000
Max. :5.800 Max. :4.200 Max. :1.900 Max. :0.3000
Sometimes you need to reshape your data
Use the tidyr package
gather will take wide data melt it
bballLong = bball %>%
select(Player, Tm, FG., X3P., X2P., trueShooting, effectiveFG) %>%
rename(fieldGoalPerc = FG., threePointPerc = X3P., twoPointPerc = X2P.) %>%
gather(key = 'vitalInfo', value = 'value', -Tm, -Player)
bballLong %>% head Player Tm vitalInfo value
1 Quincy Acy NYK fieldGoalPerc 0.459
2 Jordan Adams MEM fieldGoalPerc 0.407
3 Steven Adams OKC fieldGoalPerc 0.544
4 Jeff Adrien MIN fieldGoalPerc 0.432
5 Arron Afflalo TOT fieldGoalPerc 0.424
6 Arron Afflalo DEN fieldGoalPerc 0.428
Going the reverse direction - use spread
bballLong %>%
spread(vitalInfo, value) %>%
head Player Tm effectiveFG fieldGoalPerc threePointPerc trueShooting twoPointPerc
1 A.J. Price CLE 0.2647059 0.265 0.000 0.3002183 0.391
2 A.J. Price IND 0.5224719 0.438 0.385 0.5416839 0.480
3 A.J. Price PHO 0.2142857 0.214 0.000 0.2142857 0.429
4 A.J. Price TOT 0.4270073 0.372 0.263 0.4506641 0.450
5 Aaron Brooks CHI 0.4951040 0.421 0.387 0.5338198 0.442
6 Aaron Gordon ORL 0.4783654 0.447 0.271 0.5173735 0.500
A very common task is to look at group-based statistics
We can use group_by and summarize
scoringDat %>%
group_by(Pos) %>% # can group by more than one variable
summarize(meanTrueShooting = mean(trueShooting, na.rm = TRUE)) # potentially any function can work# A tibble: 11 x 2
Pos meanTrueShooting
<chr> <dbl>
1 C 0.5444475
2 PF 0.5117584
3 PF-SF 0.4561743
4 PG 0.4812083
5 PG-SG 0.5115452
6 SF 0.5089447
7 SF-PF 0.5537975
8 SF-SG 0.5498324
9 SG 0.4948718
10 SG-PG 0.5435740
11 SG-SF 0.4706034
Use do on grouped data to go further
state.x77 %>%
data.frame() %>%
mutate(Region=state.region) %>%
group_by(Region) %>%
do(PopIncCorr=cor(.$Population, .$Income)) %>%
data.frame() Region PopIncCorr
1 Northeast 0.4216813
2 South 0.2901941
3 North Central 0.2940437
4 West 0.1684014
Use do on grouped data to go further
library(nycflights13)
carriers = group_by(flights, carrier)
group_size(carriers) [1] 18460 32729 714 54635 48110 54173 685 3260 342 26397 32 58665 20536 5162 12275 601
## mods = do(carriers, mod = lm(arr_delay ~ dep_time, data = .)) # reminder that data frames are lists
## sapply(mods$mod, summary)?state.x77
Using one pipe sequence
state.x77 %>%
data.frame %>%
mutate(Region = state.region,
State = state.name) %>%
filter(Population > 1000) %>%
select(Region, starts_with('I')) %>%
group_by(Region) %>%
summarize(meanInc=mean(Income),
meanIll=mean(Illiteracy))# A tibble: 4 x 3
Region meanInc meanIll
<fctr> <dbl> <dbl>
1 Northeast 4731.0 1.066667
2 South 3958.8 1.793333
3 North Central 4607.9 0.710000
4 West 4525.0 1.085714
Recap thus far:
%>% : Passes the prior object to the function after the pipe
Example:
iris %>% head Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
head(iris) Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
Other pipes are available with magrittr and other packages
%$%: Exposes the names in the prior to the function after
Example (no data argument needed for lm):
iris %$% lm(Sepal.Length ~ Sepal.Width)
Call:
lm(formula = Sepal.Length ~ Sepal.Width)
Coefficients:
(Intercept) Sepal.Width
6.5262 -0.2234
%T>% : Passes to the next and subsequent steps
Example:
iris %>% select(Sepal.Length, Sepal.Width) %T>% plot %>% summary Sepal.Length Sepal.Width
Min. :4.300 Min. :2.000
1st Qu.:5.100 1st Qu.:2.800
Median :5.800 Median :3.000
Mean :5.843 Mean :3.057
3rd Qu.:6.400 3rd Qu.:3.300
Max. :7.900 Max. :4.400
Unfortunately the T pipe does not allow for some things
iris %>% select(Sepal.Length, Sepal.Width) %T>% plot %>% summaryiris %>% select(Sepal.Length, Sepal.Width) %T>% summary %>% plot%<>% : assigns to former object the operations that follow
x = c(1:3,9:7,10,4:6)
x %<>% .^2 %>% sort
x [1] 1 4 9 16 25 36 49 64 81 100
Piping is not limited to data management functions
Any R function can be potentially piped to
data.frame(y=rnorm(100), x=rnorm(100)) %$%
lm(y ~ x)
Call:
lm(formula = y ~ x)
Coefficients:
(Intercept) x
-0.05802 0.13975
This facilitates data exploration
Many newer interactive visualization packages take advantage of piping
htmlwidgets makes it easy to create js-based visualizations
leaflet
dygraphs
networkD3
DT
rthreejs
Use leaflet for quick and easy maps
leaflet() %>%
setView(lat=42.2655, lng=-83.7485, zoom=15) %>%
addTiles() %>%
addPopups(lat=42.2655, lng=-83.7485, 'Hi!')Dygraphs focus on time-series objects
library(dygraphs)
data(UKLungDeaths)
cbind(ldeaths, mdeaths, fdeaths) %>%
dygraph()Use networkD3 for interactive networks
library(networkD3)
forceNetwork(Links = netlinks, Nodes = netnodes, Source = "source",
Target = "target", Value = "value", NodeID = "name",
Nodesize = "size", Group = "group", opacity = 0.4, legend = T,
colourScale = JS("d3.scale.category10()"), fontSize=24)DT package provides interactive data
library(DT)
datatable(select(bball, 1:5), rownames=F)Plotly is a cross platform interactive visualization library
It also works with JS, Python, Matlab and others
Reminder of what’s in the data:
bballLong %>% head Player Tm vitalInfo value
1 Quincy Acy NYK fieldGoalPerc 0.459
2 Jordan Adams MEM fieldGoalPerc 0.407
3 Steven Adams OKC fieldGoalPerc 0.544
4 Jeff Adrien MIN fieldGoalPerc 0.432
5 Arron Afflalo TOT fieldGoalPerc 0.424
6 Arron Afflalo DEN fieldGoalPerc 0.428
Plotly works by starting with a base layer…
to which subsequent layers are added…
with additional options if needed.
Layers are added through this same piping system.
library(plotly)
bballLong %>%
group_by(Tm, vitalInfo) %>%
summarize(avg = mean(value)) %>%
plot_ly(width=700) %>%
add_markers(x=~Tm, y=~avg, color =~vitalInfo) %>%
layout() # options go hereUsing plotly and the data set mtcars, we’ll create a grouped scatterplot without creating any new objects.
Now, starting with your with the mtcars data set:
prediction = predict(lm(mpg~wt, data=mtcars))
library(plotly)
mtcars %>%
mutate(amFactor = factor(am, labels=c('auto', 'manual'))) %>%
plot_ly(width=300, height=200) %>%
add_markers(~wt, ~mpg, color=~amFactor, alpha=.5, hover=c(~wt, ~mpg)) %>%
add_lines(x=~wt, y=prediction, type=2, name='prediction')Note that much of the functionality you see is in base R
Now you now have more flexible approaches
Think of these packages as exploratory tools
Use them to bring clarity to code
Use them to explore your data more easily for visualization and modeling
With more use, the easier it gets, and the more you can do
Michael Clark
Consulting for Statistics, Computing & Analytics Research
Advanced Research Computing
University of Michigan
With notable assistance from:
Seth Berry
Mendoza College of Business
University of Notre Dame